How to Maximize an Original Post's Retweet Count

by CaffeineOverflow.

Context

Twitter, as an online social media platform, is a place where we relax ourselves, interact with others and know what’s happening around the world. Also, over the past few years, it has been a distinct social medium where the president of the United States communicates with the public. Thus it’s without doubt an indispensable part of many people’s daily life. Naturally, we’d like to figure out what one can do to make a tweet seen by more people and to gain more interactions, and thus to make oneself more influential.

Here we mainly focus on analyzing the factors that covary with the original post's retweet count. In the original paper, they analyzed information diffusion at user and tweet level. At the user level, they employed multilevel generalized models for predicting retweetability and retweet count received by the original tweets. The results show us the number of followers/ followees are positively associated with retweet count while the number of reciprocal ties are negatively correlated. However, the results only tell us they are positively/negatively correlated. We still have little idea of how exactly are they correlated: for example, Does the retweet count linearly increase with the number of followers? Is there certain "threshold effect"? Also, we are interested in whether posting in English - an international language - helps in getting more retweet count. If that is the case, one might choose to post in English! At the tweet level, they found that the presence of hashtags is positively correlated and the presences of URLs and mentions are negatively correlated. This is something interesting! It suggests posting with a hashtag! However, one might also be interested to know when is the best time to post. A hint is suggested by the paper that users are more active in weekdays than on weekends and during a day, 8pm is when the number of active users reach the peak. Well, is it really that posts posted at 8pm get the most retweet count on average? Besides, Christmas is coming, can we get more retweets during holidays?

Let's explore together!

The Data

EgoTimelines: The dataset contains ego users’ dynamic activities, including posting original tweets, retweeting, replying and @-mentioning. It also contains information about the number of retweets, presence of URLs and presence of hashtags for each tweet. The dataset is ideal to analyze the contributions of factors at the post level to the post’s influence.

EgoAlterProfiles: The dataset contains sampled users profiles, including the number of followers, languages, account created time, etc. Combined with the “retweet_count” term in EgoTimelines dataset we can analyze the effect of the number of followers and language (factors on the user level) on the post’s influence.

EgoNetwork: This dataset contains all pairs of ego-alter relationships, where the number of followees for each ego users can be calculated and further analyzed.

Step1: read in relative dataset and preprocess

egos have egoID ranging from 1-34006, alters have id ranging from 34007-2516190

Preprocess0: filter out profiles of egos only

Preprocess1: for each user, get the number of followees

(The number of followers/ reciprocal ties (friends) are already given in ego profiels.)

Step2: Analyze the factors at the user level

Step 2.1: Filter out the original tweets

Original tweets are statuses that are not replies or retweets.

Step 2.2: Group the original tweets by the egoID

Step 2.3: Plot the number of averaged retweet count as a function of the number of followers/ followees/ friends

First we noticed that the average numbers of retweet count are not very large: mostly are below 10, some are more than 20. Further we seperated egos into super influencers and normal users according to the average number of retweet counts and analyze seperately.

Here two users actually draw our attention: 1. user 18670, who seems to be a famous person as she has lots of followers; 2. user 17159 - a user just like us! - with only one follower. But how come does she make it to have so many retweets?

From above analysis, we realize that, if you already have many followers, you might not need to do much to get retweets; but if you do not, just like user 17159, it's a smart choice to post with hashtags, which will make your posts seen by more!

Now for the normal users, to exclude the effects of hashtags (and also urls, mentions as they are shown to have certain effects in the original paper), we consider only the original posts with aimple plain texts.

We noticed that as the followers/followees/friends counts increases (from the first row to the fifth row) the probability of receiving more retweets increases (the plot gets "fatter"). Especially when number reaches three digits the effect starts to become significant.

We admit that there might be strong multicollinearity among these factors, as shown in the next plot - especially between the number of followers and the number of friends. But our conclusion remains the same: By following more people, being more interactive (getting more friends), the user can hopefully get more followers and more retweets. When the figure reaches two digits from one digit, the user might not see much increase in terms of retweet count of a plain post, but once one has made it to three figures, he can expect to see some significant changes.

Step 2.3: English/ non English; matching

We further attempt to study the impact of language by considering language='en' (english) as our "treatment".

First we noticed, for the super influencer, there are both english and non-english speakers.

Then we create a regressor that predicts median retweet counts for each user.

Coefficient for 'english' can be interpreted as the difference in the predicted value in decision for each one-unit difference in 'english' if other independent variables remain constant. Since 'english' is a categorical variable coded as 0 or 1, a one unit difference represents switching from one category to the other. So compared to users that do not speak English, we would expect English speakers to decrease the retweet count by 0.04. However this coefficient has a p value of 0.968, which is absolutely not statiscally significant. The result suggests posting in an international language or not like English will not affect the retweet count significantly. (Here we assume that

So far we've studied this question very grossly, next we are going to examine more closely: things might be different for each language specifically as different population might have different habit of usage.

Looking at the user language data, we can recognize some patterns in terms of the correlation between user language and user activeness in terms of retweeting. In particular, we can categorize the countries into 3 categories:

But let's don't jump into the conclusion. Let's take a closer look! - Distribution should be more precise than a summary statistic.

The fanatics: Arabic speakers have the highest average number of followers count, and they also have the highest retweets frequency among all langauge user groups. (Here we assume a high retweet count suggests that language users are more likely to retweet, and suggests they will more likely to retweet what you have posted.) Therefore, if you happen to be an Arabic speaker, don't hesitate about posting in Arabic!

The actives: Dutch speaker, and Japanese speakers share a common point: they are enthusiastic about tweeting and retweeting (!) even though they don’t necessarily have a high number of followers count. This may be explained as they really focus on their favorite followers and keep interacting with them rather than follow many people but remain inactive. Therefore, if you are Dutch or Japanese speaker, we would suggest you to posting in these languages as your post might get a higher chance of being retweeted by your people!

(Well, maybe Japanese speakers can be categerized into "the fanatics" in terms of their activity pattern - they like following people, tweeting and retweeting! And their friends counts are also crazily high which suggests that they like interacting with people on tweet.)

The silent group: Turkish speaker, Russian speaker and German speakers are the exact opposite of the previous category. They have a high average follower count, which means they are active on this social media. But in the same time they have a low tweet and retweet frequqncy, which could be explained as they are more cautious about expressing their opinions (tweeting/ retweeting what others say). Therefore, if you are Turkish, Russian or German speaker, you might like to post in English to change your audiances to those who are more willing to retweet. :P

Step 3. Analyze the retweet count at the tweet level

Step 3.1 Analyze retweet count for each type of tweet

Retweet pattern may vary for different kinds of tweets, we will first determine the tweet type and check out the distribution by plot.

RT_egos are retweets that are retweeted from egos. RT_others are other retweets. Original tweets are not replies or retweets. Displayed in the table are the mean, 85th,90th,95th quantile, and tweet count.

We can see that the quantiles for original post and replies are 0, but not for retweets. That means only a small fraction of tweets are retweeted. Notice that the retweets don't have retweet count themselves, the value just shows how many times the same original tweet has been retweeted. So we exclude the retweets from our retweet count analysis.\ We can see that most retweets are tweeted from users other than egos, which makes sense since egos are just small samples of the whole population.\

Step 3.2 Explore if Twitter Usage Changes According to Month

Before checking out how tweeting time effects the retweet_count, let's first check out how Twitter usage changes according to time. The orignial paper examined the circadian rhythm of the day and week. So here, we examine the rhythm of the year.\ According to supplementary information, only egos with utc_offset information are used to produce the circadian cycles.

Now group by month, to check out if there's any month pattern to twitter. We also seperate each type of tweet here.

Please check out the html version for interactive plot.

Wow, it seems like the usage of Twitter increases from November to next year's October and then suddenly drops to the bottom and start to increase again, Why is that? Maybe we can check on the whole dataset. Since the possibilty to get wrong month due to utc_offset is not so great.

Include year in the plot this time, maybe not all years have the same pattern.

Ah-hah, it turns out that there are more tweets in 2014 than other years. But the authors only colleted data before the November of 2014. And the usage of Twitter increases through out the year, but very likely, it's just because more people start to use Twitter over time.\ We will keep in mind that twitter usage grows as time goes by, and that the data in year 2014 is not complete, and then continue our study of factors infuencing retweet count. We would expect that the number of retweets also grow as time goes by.

Clearly, they didn't collect all the data for the last day : ) But we can see the overall trend for the number of retweets is growing.

Step 3.3 Retweet Count and Time of Tweeting

Now let's check out how tweeting time (hour and day of the week) effects the retweet_count. We will focus on non_retweets (explained in step 3.1).

Step 3.3.1 distribution of retweet_count

First extract non_retweets(original posts and replies) and check the distribution of retweet count.

Since we want to find out how to maximize retweet count, we will focus on those tweets that are retweeted at least 10 times below.

We first check out how they are distributed.

It is close to power law distribution. Now, finally we will get to know if we can maximize our retweet_count by choosing the right time.

Step 3.3.2 Daily and Weekly Pattern

We group by day of the week and hour to see if the pattern for the most tweeted ones is different from the overall pattern

From the above figure, we can tell that the number of most retweeted origianl tweets and replies don't follow the general trend to increase from 5-11h and 12-21h. instead, most of these tweets are posted during 9.a.m.-11 a.m. \ But does the posting time actually affect retweet count? We can first check the correlation with a regression model.

We notice that for hour 9 and 10, there's small but siginficant postive correlation with retweet count. \ We set 9.a.m. and 10 a.m. to be the peak hours and furthur anlayze the correlation with varying threshold of retweet counts.

There is a small (0.16), but significant (p < 0.05) positive correlation when N <=30, which may indicate that if you just started out on Twitter, maybe you can try tweeting at 9 or 10 a.m.

We don't have enough data available to train a decent classifier, so instead of propensity score matching,we will find some users to examine if the time of posting influences the retweet count.

User 18670 have 824 tweets that are tweeted more than 10 times. We take a closer look at his/her tweets.

Though user 18670 tweeted very often during the peak hours, the time of the tweet doesn't show effect on the number of retweets he/she gets. So when you already have a lot followers(rember this is the same super influencer we analyzed in step 2.3), maybe you can tweet whenever you want.

Step 3.3.3 Can you get more tweet on holidays?

To analyze the effect of holiday, we choose the most celebrated holiday in English speaking countries:Christmas. In the analysis below we keep countries where Christmas is a national public holiday. Specifically, users with language: en en-gb es fr nl pt\ We exclude year 2014 from our analysis here, since that year's data is not complete(see step 3.2)

Group by day of the year, and get the average retweet count for each day.

The average number of retweets clearly peaked on Christmas eve, maybe a good time to tweet!